The Truth About Airline Statistics

by Darren Gidado

In [1]:
from PIL import Image # Library for importing images
Image.open('images/airline_cover.jpg')
Out[1]:

Introduction 🛫

What is the ASA?

ASA stands for 'American Statistical Association', ASA is the main professional organsiation for statisticians in the United States. The organization was formed in November 1839 and is the second oldest continuously operating professional society in the United States. Every other year, at the Joint Statistical Meetings, the Graphics Section and the Computing Section join in sponsoring a special Poster Session called The Data Exposition , but more commonly known as The Data Expo. All of the papers presented in this Poster Session are reports of analyses of a common data set provided for the occasion. In addition, all papers presented in the session are encouraged to report the use of graphical methods employed during the development of their analysis and to use graphics to convey their findings.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

What is the ASA 2009 Data Expo?

The ASA Statistical Computing and Graphics Data Expo is a biannual data exploration challenge. Participants are challenged to provide a graphical summary of important features of the data. The task is intentionally vague to allow different entries to focus on different aspects of the data, giving the participants maximum freedom to apply their skills. The 2009 data expo consisted of flight arrival and departure details for all commercial flights on major carriers within the USA, from October 1987 to April 2008. This is a large dataset: there are nearly 120 million records in total, and takes up 1.6 gigabytes of space compressed and 12 gigabytes when uncompressed. The complete dataset and challenge are available on the competition website http://stat-computing.org/dataexpo/2009/.

Because the dataset is so large, we also provided participants introductions to useful tools for dealing with this scale of data: Linux command line tools, including sort, awk, and cut, and sqlite, a simple SQL database. Additionally, we provided pointers to supplemental data on airport locations, airline carrier codes, individual plane information, and weather. This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008.

Quoted from citation: https://www.tandfonline.com/doi/abs/10.1198/jcgs.2011.1de

Project Aims

This dataset reports flights in the United States, including carriers, arrival and departure delays, and reasons for delays, from 1987 to 2008. For our analysis, we are going to be using the three year period between 1989 to 1991 as our sample dataset as the whole dataset is too large to practically observe with Jupyter. Our goal is to ask many questions such as:

  • How many flights are there?
  • Which airlines have the most flights?
  • What time of the week passengers fly the most?
  • How does season change the frequency and the destination of travel?
  • Which routes are the most popular?
  • Which routes and airports experience the most delays?
  • Which airports are the busiest in terms of inbound and outbound flights?
  • Which popular routes are delayed the most?
  • Which times are airports the busiest in terms of flights by each hour?
  • How does flight distance affect departure and arrival delays?
  • Which airline experiences the fewest delays?

Table column descriptions:

In [2]:
import pandas as pd
pd.read_csv('misc_data/master/variable-descriptions_master.csv')
Out[2]:
name description
0 Year 1987-2008
1 Month 12-Jan
2 DayofMonth 31-Jan
3 DayOfWeek 1 (Monday) - 7 (Sunday)
4 DepTime actual departure time (local, hhmm)
5 CRSDepTime scheduled departure time (local, hhmm)
6 ArrTime actual arrival time (local, hhmm)
7 CRSArrTime scheduled arrival time (local, hhmm)
8 UniqueCarrier unique carrier code
9 FlightNum flight number
10 TailNum plane tail number
11 ActualElapsedTime in minutes
12 CRSElapsedTime in minutes
13 AirTime in minutes
14 ArrDelay arrival delay, in minutes
15 DepDelay departure delay, in minutes
16 Origin origin IATA airport code
17 Dest destination IATA airport code
18 Distance in miles
19 TaxiIn taxi in time, in minutes
20 TaxiOut taxi out time in minutes
21 Cancelled was the flight cancelled?
22 CancellationCode reason for cancellation (A = carrier, B = weat...
23 Diverted 1 = yes, 0 = no
24 CarrierDelay in minutes
25 WeatherDelay in minutes
26 NASDelay in minutes
27 SecurityDelay in minutes
28 LateAircraftDelay in minutes

Step 1: Import 📚

In [3]:
# Import libraries
import pandas as pd
import numpy as np
import datetime as dt
import markdown
import seaborn as sb
import matplotlib.pyplot as plt
import matplotlib.ticker as tick
%matplotlib inline

# Display dataframe all columns
pd.set_option('display.max_rows', 636)
pd.set_option('display.max_columns', 636)

Step 2: Gather 💾

In [5]:
# flight data
df8991 = pd.read_csv('dataset/1989-1991.csv')
In [6]:
# misc_data
airports = pd.read_csv('misc_data/master/airports_master.csv')
carriers = pd.read_csv('misc_data/master/carriers_master.csv')
plane_data = pd.read_csv('misc_data/master/plane-data_master.csv')
var_desc = pd.read_csv('misc_data/master/variable-descriptions_master.csv')
In [9]:
from PIL import Image # Library for importing images
Image.open('images/Boeing_727-22_United_Airlines.jpg')
Out[9]:

Step 3: Univariate Exploration 📈

During this step, we will investigate distributions of individual variables to prepare for observing the relationships between variables.

3.1. Total Flights by Month

In [15]:
# Plot
flights_by_month.plot(kind = 'bar', figsize=(10, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Month', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_month.png')
plt.show()

Observation 1: For this dataset we have used the data between 1989-1991, therefore we grouped the data into monthly averages between those years. can see August has the most flights on average and Feburary had the least. This shows us that summer months have the most flights and winter months the least. Feburary had 12% less flights compared to the monthly average of 1,252,585.

3.2. Total Flights by Season

In [18]:
# Plot
flights_by_season.plot(kind = 'bar', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Season', fontsize = 15)

# Set axis labels
plt.ylabel('Number of flights', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_season.png')
plt.show()

Observation 2: Summer has the most total flights but the difference between them is not as much as you might think. From my calculations, the percentage difference between the best and worst seasons is only 5%. Spring has a strong performance placing 2nd, Autumn 3rd and Winter a more distant 4th.

3.3. Total Flights by Day

In [23]:
# Plot
flights_by_day.reindex(index=flights_by_day.index[::-1]).plot(kind = 'barh', figsize=(10, 5), rot=0)

# Set title label
plt.title('Total Flights by Day', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
ax.axvline(flights_by_day['Flights'].mean(), color='red', linewidth=2, linestyle='dashed', label='Average')

# Legend
plt.legend(bbox_to_anchor=(1.2, 1), loc=1, borderaxespad=0.0)

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_day.png')
plt.show()

Observation 3: Monday has the most total flights with 2,211,445 but the difference between Monday and the other weekdays is rather negligable. Saturday shows a near 10% dip in flights compared to Monday while Sunday dips under 6% using the same comparison, this suggests passengers are flying more during the week than weekends. This chart shows a 10% variance between the day with the highest and lowest amount of flights, 3 years of data shows consistency in strong demand for flights regardless of the day of the week.

3.4. Total Flights by Hour

In [28]:
# Plot
flights_by_hour.plot(kind = 'bar', figsize=(10, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Hour', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_hour.png')
plt.show()

Observation 4: 08:00 is the most popular time to fly with 1,103,998 flights according to departure time data from 1989-1991. We can see that there is a dip at 10:00, perhaps due to scheduling or other external reasons. The next dip we see on the chart is at 14:00 hours before recovering again until 18:00 where the flights steadily drop off. 03:00 recorded the lowest number of flights at only 2,593, that is a -426x difference in total flights vs 08:00. Flights between midnight and 05:00 are lower due to many factors including being the time people are generally sleeping but also noise restrictions.

3.5. Flight Duration in Minutes

In [41]:
# Plot
flights_by_duration.plot(kind = 'area', figsize=(7, 5), rot=0, legend=None)

# Set title label
plt.title('Total Flights by Duration', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Minutes', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_duration.png')
plt.show()

Observation 5: Our area graph above shows us flights average around 110 minutes while the median flight duration is 91 minutes. The longest flight duration was 1883 minutes which equates to over 31 hours! The conclusion is 98.1% of flights are under 300 minutes and the remaining 1.9% of flights are generally under 300 minutes or 5 hours.

3.6. Flight Distance in Miles

In [46]:
# Plot
plt.hist(data = flights_by_distance_idx, x = 'FlightDistance', bins = 25) #, y = 'Flights'

# Set title label
plt.title('Total Flights by Distance', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('Distance (Miles)', fontsize = 13)

# Ticks fixed for histogram
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(lambda x, pos: '{:,.0f}'.format(x/1) + 'K'))

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_distance.png')
plt.show()

Observation 6: Most flights range from around 100 to 1000 miles and steadily decline after, with the average being 632 miles. 2500-3000 miles seems to be the point flights most drastically reduce. Statistically 0.1% of flights go over 2800 miles.

3.7. Flights by Carrier

In [58]:
# Variables
labels = [' ',' ',' ',' ',' ',' ',' ',' ',' ',' ']
explode = [0.1, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01]

# Plot
carriers_top_10.plot(kind='pie', figsize=(7, 5), rot=0, subplots=True, labels=labels, legend=True, 
                     cmap='jet', autopct='%1.1f%%', pctdistance=1.15, explode=explode)

# Legend
carrier_list = list(carriers_top_10_idx2['description'])
plt.legend(labels = carrier_list, bbox_to_anchor=(1.55, 1), loc=1, borderaxespad=0.0)

# Set title label
plt.title('Total Flights by Carrier', loc='center', fontsize = 15)

# Set axis labels
plt.ylabel('', fontsize = 13)
plt.xlabel('', fontsize = 13)

# Ticks
ax = plt.gca()
ax.yaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/total_flights_by_carrier.png')
plt.show()

Observation 7: The most popular carrier is US Airways Inc with 2,575,160 flights between 1989-1991, that represents an 18.1% share of all flights from our data. Delta Air Lines Inc. is a close 2nd place with 2,430,469 or a 5.6% difference from 1st place with a 17% share of the market. Those two airlines would have gone toe to toe dominating aviation at that time. However, American Airlines Inc. features in 3rd place with 2,084,513 flights with a strong 14.7% market share. The gap widens to a 14% difference in total flights vs 1st place. This trend continues as the other airlines share smaller and smaller stakes in the market.

3.8. Flight Distance by Day

In [62]:
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")

# Figsize
plt.figure(figsize = [15, 6])
plt.suptitle('Elapsed Time and Distance vs Day', fontsize=15)

# Plot 1
plt.subplot(1, 2, 1)
weekday['ActualElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time', legend=None)
plt.xlabel('')
plt.ylabel('')

# Plot 2
plt.subplot(1, 2, 1)
weekday['CRSElapsedTime'].rename(index = idx, inplace = False).plot(kind = 'line', title='Actual Elapsed Time vs CRS Elapsed Time (Min)', legend=None)
plt.legend()
plt.xlabel('')
plt.ylabel('')

# Plot 3
plt.subplot(1, 2, 2)
weekday['Distance'].rename(index = idx, inplace = False).plot(kind = 'line', title='Distance (Miles)', legend=None)
plt.xlabel('')
plt.ylabel('')

# Show
fig = ax.get_figure()
fig.tight_layout()
fig.subplots_adjust(top=2)
plt.savefig('images/charts/time_vs_distance_vs_day.png')
plt.show()

Observation 8: Flight time seems to increase as the week passes. The weekend has the longest flights on average, indicating passengers might be travelling on weekend holidays but commute on weekdays. We also know from other data that fewer flights take place during the weekend but also delays are lower while flight duration and distance is longer on average.

3.9. Delay by Origin and Destination

In [70]:
# Plot
#fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6), gridspec_kw={'width_ratios':[1,1]})
#plt.suptitle('Delay by Destination vs Arrival')

# Chart 1
plt.subplot(1, 2, 1)
origin['DepDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'Barh', figsize=(10, 5), legend=None, title='Departure Airports by Delay')

# Chart 2
plt.subplot(1, 2, 2)
dest['ArrDelay'].sort_values(ascending=False).iloc[0:10].plot(
    kind = 'Barh', figsize=(10, 5), legend=None, title='Arrival Airports by Delay', label='Delay in minutes')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_origin_dest.png')
plt.show()
In [71]:
# Plot
# Chart 1
plt.subplot(1, 2, 1)
aphigh_delay.plot(
    kind = 'Barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Highest Total Delay')

# Chart 2
plt.subplot(1, 2, 2)
aplow_delay.plot(
    kind = 'Barh', figsize=(10, 5), legend=None, title='Top 10 Airports by Lowest Total Delay')

# Show
plt.tight_layout(pad=1)
plt.savefig('images/charts/delay_by_airport.png')
plt.show()
In [73]:
# Airport name filter
airports[airports['iata'] == 'PSE']
Out[73]:
iata airport city state country lat long
2674 PSE Mercedita Ponce PR USA 18.008303 -66.563012

Step 4: Bivariate Exploration 📈

4.1. Airports by Traffic

In [89]:
import seaborn as sns
fig, (ax1, ax2, axcb) = plt.subplots(1,3, figsize=(15,6), gridspec_kw={'width_ratios':[1.05,1,0.08]})

# Plot
fig.suptitle('Top 10 Airports by Flights', fontsize=15)
ax1.title.set_text('Origin Airports')
ax2.title.set_text('Destination Airports')
sns.heatmap(top_10_origin, cmap='Purples', linewidths=5, cbar=None, ax=ax1)
sns.heatmap(top_10_dest, cmap='Purples', linewidths=5, cbar_ax=axcb, ax=ax2)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/top_10_airports_by_flights.png')
plt.show()
In [92]:
# Plot size
fig, ax = plt.subplots(figsize=(10,8))
ax.set_title('Top 10 Airports by Total Flights', fontsize=15)
sns.heatmap(top_10_airports.iloc[0:10,0:12], cmap='Blues', linewidths=5, cbar='Blues')

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/top_10_airports_by_total_flights.png')
plt.show()

Observation 10: The top 10 airports by total inbound and outbound flights is topped by Chicago O'Hare International in 1st place with 1,565,539 flights. 2nd place is William B Hartsfield-Atlanta International with 1,340,568, and close 3rd is Dallas-Fort Worth International with 1,339,165. During the summer Chicago O'Hare International is very busy with strong demand throughout the year. Atlanta International follows a similar trend but Dallas-Fort Worth International show a weak January and Feburary in terms of total flights.

In [97]:
# Creating a combo chart
fig, ax1 = plt.subplots(figsize=(10,6))
color = 'tab:green'

# Creating Bar Plot
ax1.set_title('Airlines by Month vs Average Distance', fontsize=15)
ax1 = sns.barplot(x='Month', y='CarrierCount', data=carriers_month_v_distance.reset_index(), palette='summer')
ax1.set_xlabel('Month', fontsize=13)
ax1.set_ylabel('Carriers', fontsize=13, color=color, weight='bold')
ax1.tick_params(axis='y')

# Sharing the same x-axis
ax2 = ax1.twinx()
color = 'tab:red'

# Creating Line Plot
ax2 = sns.lineplot(x='Month', y='Distance', data=carriers_month_v_distance.reset_index(), 
                   sort=False, color=color, linewidth = 2, marker='o')
ax2.set_ylabel('\nAverage Distance (Mi)', fontsize=13, color=color, weight='bold')
ax2.tick_params(axis='y', color=color)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/airlines_by_month_vs_distance.png')
plt.show()

Observation 11: On average, we can count around 12-14 carriers on any given month but the chart shows us from September the amount of carriers reduces into December suggesting reduced holidays. The opposite seems true when we look at average distance travelled by month. This number is increaing from 620 miles on average to 640 miles suggesting fewer longer flights. This correlation suggests perhaps passengers are flying less often but further in the winter than the summer. This could mean going from a cold country to a hotter country or going to see family for the festive season in December.

4.3. Average Flight Delay vs Distance

In [101]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(15,6), gridspec_kw={'width_ratios':[1,1]})

# Plot
fig.suptitle('Daily Avg. Flight Delay vs Distance', fontsize=15)
sb.regplot(data = delay_v_distance, x = 'DepDelay', y = 'Distance', ax=ax1)
sb.regplot(data = delay_v_distance, x = 'ArrDelay', y = 'Distance', ax=ax2)

# Labels
ax1.title.set_text('')
ax1.set_xlabel('Departure Delay (Min)')
ax1.set_ylabel('Distance (Mi)')

ax2.title.set_text('')
ax2.set_xlabel('Arrival Delay (Min)')
ax2.set_ylabel('')

# Show
#plt.tight_layout(pad=3)
plt.subplots_adjust(top=0.9)
plt.savefig('images/charts/delay_vs_distance.png')
plt.show()

Observation 12: There doesn't seem to be a clear correlation between the distance and departure or arrival delay but we can clearly see one when we add a line of best fit. We can clearly observe a positive correlation in both plots indicating as the flight distance increases the greater chance of delay increases too. Please keep in mind that we are using a daily average here to plot this data because millions of data points would not be clear.

Step 5: Univariate Exploration 📈

5.1. Routes by Flights, Airline and Average Delay

In [124]:
# Colours (link: https://coolors.co/f94144-f3722c-f8961e-f9844a-f9c74f-90be6d-43aa8b-4d908e-577590-277da1)
colors = ["#f94144","#f3722c","#f8961e","#f9844a","#f9c74f","#90be6d","#43aa8b","#4d908e","#577590","#277da1","#284b63","#463f3a"] # 12 Airlines

# Plot
carrier_ra.unstack(level=0).plot(kind='barh', figsize=(13,30), color=colors, subplots=False)

# Title
plt.title('Routes by Flights, Airline, and Average Delay', fontsize=15)

# Label
plt.xlabel('Delay (Min)')

# Legend
carrier_list = list(carrier_ra.reset_index()['Carrier'].unique())
long_carrier_list = list(carrier_ra_idx.reset_index()['description'].unique())
plt.legend(long_carrier_list)

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay.png')
plt.show()
In [125]:
# Plot
g = sns.catplot(data=carrier_ra_idx, kind='swarm', x='AverageDelay', y='Route', hue='description', height=10, s=9)
g._legend.set_title("Carriers")

# Show
#plt.tight_layout(pad=3)
plt.savefig('images/charts/routes_by_flights_airline_average_delay_2.png')
plt.show()

Observation 13: LAS - LAX shows us many interesting insights. For example, American Airlines has the fewest delays on this route with an average delay of just -0.57 minutes, this means they are actually early. Midway Airlines Inc. has the most delays totalling an average of 118.2 minutes on this route which is astonishing. Things don't fair too well for Midway on the return leg with an average delay of 54.8 minutes. For PHX - LAX, Trans World Airways LLC tops the charts with an average of 76.8 minutes and Delta Air Lines Inc. has the fewest delays with just 13.1 minutes for delays on average. For our final conclusion, we can say the average delay time is just 18.36 minutes if all carriers and routes are considered which is pretty good. The outlier here is the maximum delay of 118 minutes which is 644% of the delay average of 18.36 minutes.

5.2. Top 10 Airlines by Average Delay

In [130]:
# Plot
top_10_carriers_by_delay.plot(kind='barh', figsize=(7,10)) 

# Title
plt.title('Airlines by Average Delay', fontsize=15)

# Label
plt.xlabel('Delay in Minutes (Lower is better)', fontsize=13)

# Plot
plt.savefig('images/charts/airlines_by_average_delays.png')
plt.show()

Observation 14: Northwest Airlines has the lowest average delay on flights for departure and arrival times combined. Piedmont Aviation is the least punctual carrier with almost 26 minutes average delay times with a 52% increased average delay time vs Northwest Airlines. Another trend we can notice is that departure delays are 5% higher than arrival delays on average suggesting various factors cause this like aircraft prepeation and passengers.

Step 6: Random Exploration 🔀

6.1. Plane Data

We can round off our exploration by looking at the plane_data table. There we can find data about plane types, manufacturers, models, engine types and issue dates.

In [135]:
# Most popular aircraft manufacturers between 89-91
plane_data_8991['manufacturer'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Aircraft Manufacturers from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Aircraft', fontsize=13)

# Plot
plt.savefig('images/charts/top_manufacturers.png')
plt.show()
In [141]:
# Most popular engine types between 89-91
plane_data_8991['engine_type'].value_counts().iloc[0:10].sort_values(ascending=True).plot(kind='barh', figsize=(10,10))

# Title
plt.title('Most Popular Engine Types from 1989-1991', fontsize=15)

# Label
plt.xlabel('Total Engines', fontsize=13)

# Plot
plt.savefig('images/charts/top_engine_types.png')
plt.show()

Observation 1: Boeing is by far the most popular aircraft manufacturer between 1989 and 1991 with 267 aircraft totalling 65% of the market. The top 6 is rounded out by McDonnel Douglas and Airbus Industrie while SAAB-Scania and MD Corporation rank 5th and 6th. Interestingly, despite Boeing's dominance the most popular aircraft is a McDonnel Douglas MD-88 with 65 aircraft recorded over our 3 year sample. The rest of the top 10 is dominated by 7 Boeing aircraft, 3rd spot is filled by a McDonnel Douglas DC-9 and 10th is taken by an Airbus A320-211. Finally on engine types, the most common engine type is Turbo-Fan with 234 records, followed by Turbo-Jet with 175 then Turbo-Prop with just 2 records.

Step 7: Conclusion 📝

In [143]:
Image.open('images/concorde.jpg')
Out[143]:

7.1. Interesting Findings

From our data, we can assess passengers fly more during the summer and least during the winter and more during midweek versus the weekend. The most popular time to travel for passengers is in the morning around 8:00. Most flights average a 110 minute flight time and an average distance is 632 miles. Interestingly, the average distance for flights increases during the winter months suggesting that passengers commute further to go on distant holidays or visit family and friends across the country. Airline carriers enjoy a competitive market with no more than 36% difference in total flights between carriers. Regarding carriers, we observed that flight delays increase with flight distance. Chicago O'Hare International is the busiest airport with over 1.5 million flights recorded over the 3 year period.

7.2. Project Answers

How many flights are there?\ 15,031,014 flights.

Which airlines have the most flights?\ US Airways Inc.

What time of the week passengers fly the most?\ Monday to Friday.

How does season change the frequency and the destination of travel?\ Summer increases the frequency of average flights whilst winter decreases it.

Which routes are the most popular?

  1. SFO - LAX | 69,180 Flights
  2. LAX - SFO | 68,754 Flights
  3. LAX - PHX | 39,321 Flights

LAX: Los Angeles Intl, SFO: San Fransisco Intl, PHX: Phoenix Sky Harbor Intl.

Which routes and airports experience the most delays?\ GUC - HDN Gunisson to Haydon | 1980 minutes on average for route.\ PSE Mercedita Airport | 53.25 minutes delay on average is the highest for an airport.\ BQN Rafael Hernández Airport | -10.35 minutes delay (early) on average is lowest for an airport.

Which airports are the busiest in terms of inbound and outbound flights?\ Chicago O'Hare International.

Which popular routes are delayed the most?\ LAX - SFO | 68,754 Flights, 18.97 minutes delay on average.\ PHX - LAX | 38,756 Flights, 16.18 minutes delay on average.\ PHX - LAS | 30,778 Flights, 15.85 minutes delay on average.

Which times are airports the busiest in terms of flights by each hour?\ 08:00 - 09:00 has over 1 million flights during the morning.

How does flight distance affect departure and arrival delays?\ Flight distance increases total departure and arrival delays with 0.221398 positive correlation.

Which airline experiences the fewest delays?\ Northwest Airlines.

7.3. Key Insights for Presentation

For the presentation, I decided to look at flight counts using simple vertical and horizontal bar charts before moving on to more complex plots for deeper analysis. I used an area plot for flight duration because it looked visually clearer than using a line chart. I used a histogram for the flight distance plot to keep things interesting since my plan was to use many different plot types for the slides.

For total flights by carrier I decided to use a colourful pie chart which should catch the audiences attention since unique and vibrant colour selection is important. I also decided to explode the largest slice of the pie chart. For the delay section I used a combination of line charts and horizontal bar charts since it was the best way to show bivariate data that relies on categorical data. Next, I introduced a third variable which was categorical so I decided to use a heatmap. Heatmaps are great at showing multiple dimensions of data in a fun way, in our case it was for airport name, flight totals, and month. The darker the colour the more flights per grid box.

I used a combination chart to plot airlines vs month vs average distance. This allowed me to show multiple variables co-existing in the same axis clearly. Average flight delay vs distance used the scatterplot, another plot I was keen to try as it engages the audience with raw individual data points. This allowed us to use a line of best fit to intercept the points to see if there was a correlation visually. For routes by flights vs airline average delay it was the perfect time to use a grouped bar chart, I found some interesting colour palettes for my chart from coolor.co. I could plot multiple airlines for each route on the same bar axis and take an average of them if we wanted to. I used a swarm plot on the same data for another angle to see if we could find out more. Average airlines by delay had a similar treatment with the grouped bar chart instead replacing routes with airline carriers, this time I used a more modest colour palette. The random exploration at the end used simple bar charts again as they are a quick and easy to explore data.

References

Getting csv subdirectories: https://perials.com/getting-csv-files-directory-subdirectories-using-python/ \ Combine csv files: https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/ \ Fix .info() display: https://www.geeksforgeeks.org/python-pandas-dataframe-info/ \ CMD: https://www.tomnash.eu/how-to-combine-multiple-csv-files-into-one-using-cmd/ \ Reset and drop index: https://stackoverflow.com/questions/39616424/pandas-reset-index-creating-level0-column \ Between values 1: https://stackoverflow.com/questions/31617845/how-to-select-rows-in-a-dataframe-between-two-values-in-python-pandas \ Between values 2: https://www.geeksforgeeks.org/python-pandas-series-between/ \ String Pad or str.pad: https://www.geeksforgeeks.org/python-pandas-series-str-pad/ \ String Slicing: https://geekflare.com/python-remove-last-character/ \ Datetime without date: https://stackoverflow.com/questions/32375471/pandas-convert-strings-to-time-without-date# \ Replace values: https://datatofish.com/replace-values-pandas-dataframe/ \ Dates to seasons: https://stackoverflow.com/questions/22615288/group-data-by-seasons-using-python-and-pandas \ Rename Index: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename_axis.html \ Sort Index by list: https://stackoverflow.com/questions/45389126/sort-index-by-list-python-pandas \ Loc: https://www.geeksforgeeks.org/python-pandas-dataframe-loc/ \ DT Day of Week: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.dayofweek.html \ Pandas Interate Dataframe: https://pythonbasics.org/pandas-iterate-dataframe/ \ Strip strings inside bracket: https://stackoverflow.com/questions/20894525/how-to-remove-parentheses-and-all-data-within-using-pandas-python \ Adding colour to pie charts: https://www.pythonprogramming.in/how-to-pie-chart-with-different-color-themes-in-matplotlib.html \ Using axvline to add average line: https://stackoverflow.com/questions/45724329/let-axvline-end-at-certain-y-value \ Plot MultiIndex = https://stackoverflow.com/questions/54451127/creating-a-heatmap-from-a-pandas-multiindex-series \ Modify sns heatmaps: https://stackoverflow.com/questions/42712304/seaborn-heatmap-subplots-keep-axis-ratio-consistent \ Marker styles: https://matplotlib.org/3.1.1/api/markers_api.html \ Sum text columns: https://cmdlinetips.com/2018/11/how-to-join-two-text-columns-into-a-single-column-in-pandas/ \ Melting Dataframe: https://pandas.pydata.org/docs/reference/api/pandas.melt.html \ Concat Dataframe: https://www.geeksforgeeks.org/python-pandas-merging-joining-and-concatenating/ \ Change Seaborn plot size: https://stackoverflow.com/questions/31594549/how-do-i-change-the-figure-size-for-a-seaborn-plot \ Rename Seaborn legend: https://stackoverflow.com/questions/53116532/modify-seaborn-line-relplot-legend-title \ Rename Columns: https://chartio.com/resources/tutorials/how-to-rename-columns-in-the-pandas-python-library/ \ Plot 3 variable bar chart: https://stackoverflow.com/questions/42128467/matplotlib-plot-multiple-columns-of-pandas-data-frame-on-the-bar-chart \ Filter between two values: https://stackoverflow.com/questions/31617845/how-to-select-rows-in-a-dataframe-between-two-values-in-python-pandas/40442778 \ Drop rows matching string: https://stackoverflow.com/questions/28679930/how-to-drop-rows-from-pandas-data-frame-that-contains-a-particular-string-in-a-p \ Percentage of value_counts = https://stackoverflow.com/questions/14281871/given-a-pandas-series-that-represents-frequencies-of-a-value-how-can-i-turn-tho

In [ ]: